Imaged Document Text Retrieval Without OCR
نویسندگان
چکیده
ÐWe propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted. An n-gram based document vector is constructed for each document based on these features. Text similarity between documents is then measured by calculating the dot product of the document vectors. Testing with seven corpora of imaged textual documents in English and Chinese as well as images from UW1 database confirms the validity of the proposed method. Index TermsÐDocument image analysis, document vector, text similarity, text
منابع مشابه
Optical Font Recognition from Projection Profiles
• Recognition of logical document structures [1], where knowledge of the font used in a word, line, or text block may be useful for defining its logical label (chapter title, section title or paragraph). • Document reproduction, where knowledge of the font is necessary in order to reproduce (reprint) the document. • Document indexing and information retrieval, where word indexes are generally p...
متن کاملDocument Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملDocument Retrieval In OCR-Scanned Text
The use of a document retrieval system (PADRE) for the Fujitsu AP1000 in processing known-item search queries over OCR-scanned documents is reported. Retrieval performance of an initial set of queries is shown to deteriorate signi cantly over scanned data with a character error rate of 5%. A preprocessor is used to augment queries with terms which can be derived from original terms using charac...
متن کاملNovel Arabic OCR Degraded Text Retrieval Model
This paper provides a novel model enhances the Arabic OCR degraded text retrieval effectiveness. The model simulates the Arabic OCR recognition mistakes happens while the recognition process based on word based approach. Then using the expected OCR errors the model expands the user search query. The resulting expanded search query produced higher precision and recall in searching Arabic OCRDegr...
متن کاملRMIT University at TREC 2008: Legal Track
This paper reports on the participation of RMIT university in the 2008 TREC Legal Track Ad Hoc task. OCR errors can corrupt the document view formed by an information retrieval system, and substantially hinder the successful retrieval of relevant documents for user queries. In previous research, the presence of errors in OCR text was observed to lead to unstable and unpredictable retrieval effe...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEEE Trans. Pattern Anal. Mach. Intell.
دوره 24 شماره
صفحات -
تاریخ انتشار 2002